The advent of reusable rockets has fundamentally altered the economics of space exploration, with autonomous powered descent and landing being the critical enabling tech- nology. This maneuver presents a formidable challenge in con- trol theory, characterized by high-dimensional continuous state- action spaces, unstable nonlinear dynamics, and strict terminal constraints. This paper presents a comprehensive theoretical framework for developing a stable control policy for vertical rocket landing using Deep Reinforcement Learning (DRL). We situate the problem within the context of modern policy op- timization algorithms, reviewing the theoretical underpinnings of methods such as Proximal Policy Optimization (PPO), Trust Region Policy Optimization (TRPO), and Soft Actor-Critic (SAC). We analyze the critical role of environment design, reward shaping theory, and numerical simulation in the successful application of DRL to this domain. By systematically comparing existing methodologies, we identify key research gaps, including the sim-to-real transfer problem, the sample efficiency of on- policy methods, and the absence of formal safety guarantees. This paper validates the theoretical feasibility of using model- free DRL to solve high-stakes aerospace control problems and proposes a structured roadmap for future research in robust, verifiable autonomous guidance and control systems.
Introduction
The paper examines autonomous vertical rocket landing as a central challenge in modern space exploration, driven by the need for economically sustainable, reusable launch systems. Reusability—pioneered by companies such as SpaceX—has dramatically reduced launch costs but introduces an exceptionally difficult control problem involving nonlinear, unstable, high-dimensional dynamics, strict terminal constraints, and strong coupling between translational and rotational motion.
Traditional model-based control methods (e.g., PID, LQR, convex optimization) have been effective in aerospace applications but suffer from brittleness under modeling errors and environmental uncertainty. This motivates the use of Deep Reinforcement Learning (DRL), a model-free approach capable of learning complex control policies directly from interaction data. The paper frames rocket landing as a theoretical DRL problem, emphasizing environment design, reward shaping, and algorithm selection.
A comprehensive literature review covers key DRL algorithms—TRPO, PPO, TD3, and SAC—highlighting their theoretical foundations, stability mechanisms, and trade-offs between robustness and sample efficiency. It also discusses standardized simulation environments, numerical stability in physics simulation, and principled reward design via potential-based shaping.
The paper identifies major research gaps: the sim-to-real transfer problem, the trade-off between stability and sample efficiency in DRL algorithms, the lack of systematic reward engineering methods, and the absence of formal safety and verification guarantees for learned policies.
Finally, it proposes a future research agenda along three axes:
enhancing DRL through hybrid model-based/model-free methods and automated reward learning,
expanding the problem to more realistic 3D, multi-agent, and full-mission scenarios, and
integrating safety, formal verification, and causal reasoning to enable robust, certifiable autonomous rocket landing systems.
Conclusion
This paper has presented a comprehensive theoretical frame- work for addressing the autonomous vertical rocket landing problem using Deep Reinforcement Learning. By situating this complex aerospace challenge within the context of modern control theory and machine learning, we have reviewed the evolution of pertinent policy optimization algorithms, analyzed the theoretical requirements for high-fidelity simulation, and underscored the critical role of potential-based reward shaping in guiding the learning process. The systematic comparison of methodologies and the subsequent identification of critical research gaps, spanning sim-to-real transfer, the stability- efficiency trade-off, reward function design, and the profound need for safety verification, collectively illuminate the current limitations and future potential of the field. The proposed multi-axis roadmap for future work, which advocates for hybrid methods, expanded problem scopes, and the integration of formal and causal methods, provides a structured pathway for advancing DRL from a powerful simulation tool to a viable technology for real-world, safety-critical autonomous systems. This work validates the immense theoretical potential of DRL for solving formidable control problems and charts a course for continued innovation in autonomous guidance and control.
References
[1] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proc. 16th International Conference on Machine Learning (ICML), 1999, pp. 278–287.
[2] R. Xu and Z. Chen, “A validation tool for designing reinforcement learning environments,” in Proc. 35th AAAI Conference on Artificial Intelligence, 2021, pp. 10475–10483.
[3] G. Brockman et al., “OpenAI Gym,” arXiv preprint arXiv:1606.01540, 2016.
[4] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” in Proc. 32nd International Conference on Machine Learning (ICML), 2015, pp. 1889–1897.
[5] Vedanta, J. T. Allison, M. West, and A. Ghosh, “Reinforcement learning for spacecraft attitude control,” in Proc. AIAA Scitech 2019 Forum, 2019,
[6] p. 1949.
[7] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function ap- proximation error in actor-critic methods,” in Proc. 35th International Conference on Machine Learning (ICML), 2018, pp. 1587–1596.
[8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proc. 35th International Conference on Machine Learning (ICML), 2018, pp. 1861–1870.
[10] L. Verlet, “Computer ’experiments’ on classical fluids. I. Thermodynam- ical properties of Lennard-Jones molecules,” Physical Review, vol. 159, no. 1, p. 98, 1967.
[11] V. Mnih et al., “Human-level control through deep reinforcement learn- ing,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[12] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[13] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
[14] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,” in Proc. 33rd International Conference on Machine Learning (ICML), 2016, pp. 1928–1937.
[15] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 2018.
[16] A. Raffin et al., “Stable-Baselines3: A reliable implementation of rein- forcement learning algorithms in Python,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
[17] S. P. N. Singh, A. K. Chopra, C. A. Sharma, and S. K. Sharma, “Autonomous drone navigation using deep reinforcement learning,” in Proc. IEEE International Conference on Signal Processing and Communication (ICSPC), 2020, pp. 234–239.
[18] G. Gaudet, R. Furfaro, and R. Linares, “Reinforcement learning for autonomous planetary landing,” in Proc. IEEE Aerospace Conference, 2020, pp. 1–12.
[19] E. Seedhouse, SpaceX’s Dragon: America’s Next Generation Spacecraft. Springer, 2015.
[20] B. Acikmese and S. R. Ploen, “Convex programming approach to powered descent guidance for mars landing,” Journal of Guidance, Control, and Dynamics, vol. 30, no. 5, pp. 1353–1366, 2007.
[21] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.